Final Edited

Author

Abby Sikora

Question: Should Iowa have won the NCAA National Championship for women’s basketball?

Iowa Dataset

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
library(here)
here() starts at /Users/abigailsikora/Desktop/ds334_final_project
library(rvest)

Attaching package: 'rvest'

The following object is masked from 'package:readr':

    guess_encoding
#Iowa
url2 <- "https://www.espn.com/womens-college-basketball/team/stats/_/id/2294/iowa-hawkeyes"

h2 <- read_html(url2)

tab2 <- h2 |> html_nodes("table")

#stats with no names
iowa_df <- tab2[[4]] |> html_table(fill = TRUE)

#names table
iowa_df2 <- tab2[[3]] |> html_table(fill = TRUE)

iowa_stats <- bind_cols(iowa_df2, iowa_df)

#South Carolina

url3 <- "https://www.espn.com/womens-college-basketball/team/stats/_/id/2579/south-carolina-gamecocks"

h3 <- read_html(url3)

tab3 <- h3 |> html_nodes("table")

#stats with no names
sc_df <- tab3[[4]] |> html_table(fill = TRUE)

#names table
sc_df2 <- tab3[[3]] |> html_table(fill = TRUE)

sc_stats <- bind_cols(sc_df2, sc_df)

The data sets above are season statistics for women’s NCAA basketball teams the Iowa Hawkeyes and the South Carolina Gamecocks.

These data sets includes variables:

Name: Name of the player

MIN: Minutes Per Game

FGM: Field Goals Made

FGA: Field Goals Attempted

FTM: Free Throws Made

FTA: Free Throws Attempted

3PM: 3-Pointers Made

3PA: 3-Pointers Attempted

PTS: Points

OR: Offensive Rebounds

DR: Defensive Rebounds

REB: Rebounds (Offensive and Defensive Total)

AST: Assists

TO: Turnovers

STL: Steals

BLK: Blocks

First, let’s compare points between the teams. To make this data even, we will only look at the top ten players for amount of time players (MIN) from each team because they have a different amount of players for this and the following few tables.

library(pander)

summary_iowa <- iowa_stats |> 
  arrange(desc(MIN)) |>
  slice(1:10) |>
  summarize(points_avg = mean(PTS))

summary_south_carolina <- sc_stats |>
  arrange(desc(MIN)) |>
  slice(1:10) |>
  summarize(points_avg = mean(PTS))

combined_summary <- bind_rows(
  mutate(summary_iowa, Team = "Iowa"),
  mutate(summary_south_carolina, Team = "South Carolina")) |>
  pander()

From this table, we see that Iowa has 27.4 more points on average for the season. To analyze this number further, I want to look at Field Goals Made, because the number of points doesn’t tell us much on it’s own.

Now let’s look at Field Goals Made for each team. The difference between FGM and PTS is that FGM is the count of baskets made by each team, and PTS is the total number of points the team has by the point value of the Field Goal scored (1 - free throw, 2 - from inside the arch or 3 - anywhere beyond the arch).

summary_iowa2 <- iowa_stats |> 
  arrange(desc(MIN)) |>
  slice(1:10) |>
  summarize(fg_made = mean(FGM))

summary_south_carolina2 <- sc_stats |>
  arrange(desc(MIN)) |>
  slice(1:10) |>
  summarize(fg_made = mean(FGM))

combined_summary2 <- bind_rows(
  mutate(summary_iowa2, Team = "Iowa"),
  mutate(summary_south_carolina2, Team = "South Carolina")) |>
  pander()

From this table, we can see that this time, South Carolina has a better number but not by much(<1). This tells us that although Iowa has a higher average of points total, the accuracy of the two teams is pretty similar when it comes to average field goals actually made. An explanation for the first table could be Iowa may have more high value points, so I want to look at that next.

summary_iowa3 <- iowa_stats |> 
  arrange(desc(MIN)) |>
  slice(1:10) |>
  summarize(`3s_made` = mean(`3PM`))

summary_south_carolina3 <- sc_stats |>
  arrange(desc(MIN)) |>
  slice(1:10) |>
  summarize(`3s_made` = mean(`3PM`))

combined_summary3 <- bind_rows(
  mutate(summary_iowa3, Team = "Iowa"),
  mutate(summary_south_carolina3, Team = "South Carolina")) |>
  pander()

From this, we can see something that I had a feeling about from the previous tables. Iowa has almost double the amount of three pointers made than South Carolina. This tells us that the points average has a bigger margin of difference between the teams because Iowa simply scores higher value points more often than South Carolina.

Next out of curiosity after seeing the first few numbers, I want to compare the teams by overall points per season, seeing if there are any outliers on either team skewing these average numbers.

(Instead of looking at top ten players by Minute, we will just look at all the players on each team to get a better comparison as a whole.)

library(plotly)

Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':

    last_plot
The following object is masked from 'package:stats':

    filter
The following object is masked from 'package:graphics':

    layout
iowa_ppg <- iowa_stats |> select(Name, PTS) |>
  arrange(desc(PTS))|> 
  slice(2:13) |>
  mutate(Name = fct_reorder(Name, PTS))

iowa_plot1 <- ggplot(data = iowa_ppg, aes(x = Name,
                              y = PTS,
                              label = PTS)) +
  geom_point(color = "black") +
  geom_segment(aes(x = Name, xend = Name, y = 0, yend = PTS), color="yellow") +
  coord_flip() +
  theme(plot.background = element_rect(fill = "grey1"),
        axis.text = element_text(colour = "yellow", size = rel(1)))

ggplotly(iowa_plot1, tooltip = "label")

Same thing for South Carolina…

sc_ppg <- sc_stats |> select(Name, PTS) |>
  arrange(desc(PTS))|>
  slice(2:12) |>
  mutate(Name = fct_reorder(Name, PTS))

sc_plot1 <- ggplot(data = sc_ppg, aes(x = Name,
                          y = PTS,
                          label = PTS)) +
  geom_point(color = "black") +
  geom_segment(aes(x = Name, xend = Name, y = 0, yend = PTS), color="red4") +
  coord_flip() +
  theme(plot.background = element_rect(fill = "grey1"),
        axis.text = element_text(colour = "red3", size = rel(1)))

ggplotly(sc_plot1, tooltip = "label")

From analyzing these two plots, we see something really interesting. Right off the bat, we see that as a team, South Carolina looks like it has more even scoring between players, with a smooth decreasing trend from the top scorer. Iowa on the other hand, seems to have an outlier right at the top. Caitlin Clark(1234) has 724 more points than the next best scorer on the team(510), and that is more points than the top scorer on South Carolina has total(474).

This answers some grey area we had with the average points comparison between teams. Caitlin Clark is an obvious outlier here even looking at both teams.

Next…